Point 1
Description of data
We have two lists of length 12: asd_sel and
td_sel. In other words, we have data from 12 Autism
Spectrum Disorder subjects and 12 Typically Developed subjects. Each
slot of these lists contains a dataframe of size (145×116) : the 116
columns are related to different ROIs, whereas the 145 rows refer to
time series of level of blood oxygenation.
The following are the ASD patients
## [1] "caltech_0051472" "trinity_0050234" "trinity_0050235" "trinity_0050236"
## [5] "trinity_0050239" "trinity_0050240" "trinity_0050241" "trinity_0050246"
## [9] "trinity_0050251" "trinity_0050252" "trinity_0050253" "trinity_0050254"
The following are the TD patients
## [1] "caltech_0051487" "caltech_0051492" "trinity_0050259" "trinity_0050263"
## [5] "trinity_0050266" "trinity_0050267" "trinity_0050270" "trinity_0050271"
## [9] "trinity_0051134" "trinity_0051139" "trinity_0051140" "trinity_0051141"
We can notice there is the presence of different laboratory names: “caltech” and “trinity”. If we take a look to the data of two different laboratories we can see…
## X2001 X2002 X2101 X2102 X2111
## 1 0.016220 -0.088112 -0.184244 -0.191907 -0.349391
## 2 0.640587 0.156047 -0.022470 -0.273405 -0.494944
## 3 1.219444 0.439307 0.077278 -0.282552 -0.474252
## 4 1.238555 0.502680 -0.007576 -0.316699 -0.206840
## 5 0.683578 0.326157 -0.197045 -0.444887 0.229124
## 6 0.032254 0.154728 -0.270736 -0.550203 0.641275
## X2001 X2002 X2101 X2102 X2111
## 1 16.89461 60.54634 256.77448 695.4240 462.82086
## 2 -151.77333 -692.66290 110.63844 691.9337 -284.11908
## 3 -660.06824 -1693.21729 -168.02286 529.7900 -1096.83008
## 4 -1286.52612 -2378.78613 -62.05585 487.2489 -1095.97534
## 5 -1485.25745 -2252.13892 630.16693 562.8931 21.13595
## 6 -909.00568 -1319.04907 1416.94885 383.0337 1564.66504
The values seem to be very different, but to have a numerical representation let’s see their ranges:
## min max
## range_caltech -6.65 5.98
## range_trinity -14178.98 13774.48
We have to standardize the two types of data coming from different labs.
The most common approach to data scaling involves calculating the mean and standard deviation of each variable and using these values to scale the values.
Along with the presence of significant differences accros the two labs, we also notice that the data (coming from both labs) show several outliers. These are values on the edge of the distribution that may have a low probability of occurrence, yet are overrepresented for some reason. Outliers can skew a probability distribution and make data scaling using standardization difficult as the calculated mean and standard deviation will be skewed by the presence of the outliers.
Let’s check if our data have outlier values. One way to quickly recognize the presence of outliers is the box-and-whisker plot.
Here are few examples of the distributions of the ROIs of some patients.
Red points indicate outliers and there are several of them. For this reason we decided to use a type of standardization different from the traditional one (consisting of subtracting and dividing by, respectively, the mean and the standard deviation). One approach to standardize input variables in presence of outliers is to ignore the outliers from the calculation of the mean and standard deviation, then use the calculated values to scale the variables.
This is called robust standardization or robust data scaling.
This can be achieved by calculating the median (50th percentile) and the 25th and 75th percentiles. The values of each variable then have their median subtracted and are divided by the interquartile range (IQR) which is the difference between the 75th and 25th percentiles.
Within each group (ASD and TD), standardize the data on a per-lab + per-ROI basis. We compute the median and the IQR along the columns of the two distinct labs and then we use them to center and scale the data.
We show some standardized rows and ROIs of an asd patient (trinity_0050234)
## 2001 2002 2101 2102 2111
## 1 0.02859976 0.01379819 0.18541838 0.5603610 0.2118784
## 2 -0.10477141 -0.46940068 0.08712572 0.5575796 -0.1464323
## 3 -0.50669657 -1.11127641 -0.10030485 0.4283692 -0.5362937
## 4 -1.00205700 -1.55108260 -0.02903031 0.3944687 -0.5358837
## 5 -1.15920026 -1.46983587 0.43656615 0.4547487 0.0000000
## 6 -0.70353942 -0.87123999 0.96576409 0.3114208 0.7404384
We show some standardized rows and ROIs of an td patient (trinity_0050259)
## 2001 2002 2101 2102 2111
## 1 -0.5947603 0.01415512 -0.7694976 0.383235978 -0.9520271
## 2 -0.1768262 0.82495630 -1.0922181 0.684661414 -2.0242091
## 3 0.4704167 1.35021402 -0.6045179 0.966106375 -1.9566174
## 4 1.0455311 0.96204716 0.2739100 1.087092996 -0.5614404
## 5 1.3092391 -0.16527019 0.7263494 0.806177813 1.0519410
## 6 1.1095197 -1.15787332 0.4570422 0.002165688 1.5448142
We show an example of the new ranges for 2 asd subjects coming from different laboratories.
## min max
## range_caltech -3.11000 3.400000
## range_trinity -4.30962 3.819349
Now the ranges seem to be similar.
Since we have standardized data, we can pool them in a unique dataset for each group.
Now that we have standardized the data, we can go on to the data pooling phase.
We recall that we have 12 subjects per group, namely we have 24 datasets. In the initial setting, we have to deal with a high-dimensional problem because n \(\thickapprox\) D and this protocol will not work since \(\hat{\Sigma_n}\) is not invertible. There are some solutions to avoid this issue, but we prefer to deal with a low-dimensional problem. Since the time series are iid and the data coming from different subjects may also be considered independent from each other, we simply joined the 12 datasets into one with the same columns(features) (we do this for each group).
Dimension of the dataset for group ASD
## [1] 1740 116
Dimension of the dataset for group TD
## [1] 1740 116
The output are 2 dataset with 1.740 rows and 116 columns, now we are dealing with a low-dimensional setting because n >> D and we can estimate \(R[j,k]\) using a transformation of \(\hat{\Lambda} = \hat{\Sigma_n}^{-1}\) .
Point 2
Our goal is to get two separate estimates, \(\hat{G}^{ASD}(t)\) and \(\hat{G}^{TD}(t)\) and to do that we need to build the 95% asymptotic confidence intervals for \(\{\rho_{j,k}^{ASD}\}_{j,k}\) and \(\{\rho_{j,k}^{TD}\}_{j,k}\). The next step will be setting a threshold t equal to a meaningfully high percentile of the correlation values observed in the two groups to build the adjacency matrices for our estimated graphs.
We starting by computing the point estimate for the person correlation of ASD and TD patients.
Here, the first 6 rows of the ASD correlation matrix are shown.
## X2001 X2002 X2101 X2102 X2111
## 2001 1.00000000 0.41661588 0.0770486 -0.2819014 -0.06476566
## 2002 0.41661588 1.00000000 -0.2035541 -0.1353690 -0.07822295
## 2101 0.07704860 -0.20355413 1.0000000 0.2991563 0.42901589
## 2102 -0.28190140 -0.13536902 0.2991563 1.0000000 0.19518981
## 2111 -0.06476566 -0.07822295 0.4290159 0.1951898 1.00000000
## 2112 -0.23698061 -0.21314679 0.1013103 0.3885522 0.42378444
We want to show an example of the correlations that are in this matrix. The following plot shows the correlations between the ROI 2002 and all the others.
Obviously, the last correlation is between the feature and itself.
Here, the first 6 rows of the TD correlation matrix are shown.
## X2001 X2002 X2101 X2102 X2111
## 2001 1.000000000 0.36952652 0.07123769 -0.3096921 -0.003652585
## 2002 0.369526523 1.00000000 -0.16721060 -0.2348941 -0.073307964
## 2101 0.071237690 -0.16721060 1.00000000 0.2870047 0.278154727
## 2102 -0.309692109 -0.23489413 0.28700472 1.0000000 0.014938000
## 2111 -0.003652585 -0.07330796 0.27815473 0.0149380 1.000000000
## 2112 -0.133967760 -0.23439068 0.18372003 0.3346879 0.493386737
We want to show an example of the correlations that are in this matrix. The following plot shows the correlations between the ROI 2002 and all the others.
Obviously, the last correlation is between the feature and itself.
Lower Bound of confidence intervals for ASD pearson correlation coefficient
## X2001 X2002 X2101 X2102 X2111
## 2001 1.00000000 0.3240087 -0.030259754 -0.37756523 -0.17064124
## 2002 0.32400868 1.0000000 -0.303988297 -0.23896459 -0.18374296
## 2101 -0.03025975 -0.3039883 1.000000000 0.19845350 0.33745648
## 2102 -0.37756523 -0.2389646 0.198453495 1.00000000 0.09001202
## 2111 -0.17064124 -0.1837430 0.337456479 0.09001202 1.00000000
## 2112 -0.33552677 -0.3130617 -0.005811493 0.29371122 0.33177842
Upper Bound of confidence intervals for ASD pearson correlation coefficient
## X2001 X2002 X2101 X2102 X2111
## 2001 1.00000000 0.50131480 0.18260113 -0.18028360 0.04258839
## 2002 0.50131480 1.00000000 -0.09864508 -0.02872658 0.02907932
## 2101 0.18260113 -0.09864508 1.00000000 0.39360884 0.51253400
## 2102 -0.18028360 -0.02872658 0.39360884 1.00000000 0.29606182
## 2111 0.04258839 0.02907932 0.51253400 0.29606182 1.00000000
## 2112 -0.13330389 -0.10856541 0.20613329 0.47581792 0.50780421
As example, using a function written by us we show the interval for the coefficent \(\{\rho_{2102,2112}^{ASD}\}_{2102,2112}\)
## [1] 0.2937112 0.4758179
Lower Bound of confidence intervals for ASD pearson correlation coefficient
## X2001 X2002 X2101 X2102 X2111
## 2001 1.00000000 0.2732790 -0.03609636 -0.40337680 -0.11066806
## 2002 0.27327897 1.0000000 -0.26944591 -0.33356458 -0.17896218
## 2101 -0.03609636 -0.2694459 1.00000000 0.18565033 0.17634740
## 2102 -0.40337680 -0.3335646 0.18565033 1.00000000 -0.09226831
## 2111 -0.11066806 -0.1789622 0.17634740 -0.09226831 1.00000000
## 2112 -0.23761849 -0.3330910 0.07819937 0.23608848 0.40787236
Upper Bound of confidence intervals for ASD pearson correlation coefficient
## X2001 X2002 X2101 X2102 X2111
## 2001 1.00000000 0.45844857 0.17694694 -0.2095821 0.10344662
## 2002 0.45844857 1.00000000 -0.06124827 -0.1311330 0.03401777
## 2101 0.17694694 -0.06124827 1.00000000 0.3823163 0.37407397
## 2102 -0.20958210 -0.13113304 0.38231629 1.0000000 0.12180196
## 2111 0.10344662 0.03401777 0.37407397 0.1218020 1.00000000
## 2112 -0.02730056 -0.13060938 0.28516983 0.4264658 0.57032039
As example, using a function written by us we show the interval for the coefficent \(\{\rho_{2102,2112}^{TD}\}_{2102,2112}\)
## [1] 0.2360885 0.4264658
Building the adiacency matrices to represent the estimated graphs
Once we get a (1- \(\alpha/m\)) confidence interval (where \(m = \binom{D}{2}\)) \(C_{n}^{j,k}(\alpha)\) for \(\rho(j,k)\),we can then place an edge between feature \(j\) and feature \(k\) whenever \([−t,+t] ∩ C_{n}^{j,k}(\alpha)= ∅\)(the empty-set).
Remember that since the adjacency matrix is symmetric every edge will appear 2 times in position \(j,k\) and position \(k,j\) and to the effective number of edges we have to consider only an occurrence.
We set to zero the diagonal to avoid the value of correlation 1 between the feature j and itself.
Now we have to select a threshold t via trial-and-error using a meaningfully high percentile of the correlation values observed in the two groups.
We starting from the interpretation of the correlation:
Several approaches have been suggested to translate the correlation coefficient into descriptors like “weak,” “moderate,” or “strong” relationship (see the Table for an example). These cutoff points are arbitrary and inconsistent and should be used judiciously.
We start by showing the quartiles of the two groups, we consider the absolute value of the correlation coefficients.
## asd td
## 25% 0.050 0.051
## 50% 0.103 0.109
## 75% 0.180 0.190
We can see that for both groups the 75% of the correlation values are lower than 0.18 for asd group and lower than 0.19 for td group. It means that for the most part values fall into the category of “weak” correlation. We are interested in setting a meaningfully high value to understand the most correlated ROIs. We create a sequence of percentiles frome 0.75 to 1 and we will select the one that has a correlation values at least equal to 0.5, which is considered as a threshold for a moderate intensity correlation.
## asd td
## 75% 0.180 0.190
## 77.5% 0.192 0.202
## 80% 0.205 0.215
## 82.5% 0.219 0.228
## 85% 0.235 0.247
## 87.5% 0.256 0.268
## 90% 0.285 0.292
## 92.5% 0.322 0.328
## 95% 0.384 0.391
## 97.5% 0.510 0.504
## 100% 1.000 1.000
Through the table above we chose the percentile at 97.5% because it corresponds to a threshold greater than 0.5, which is meaningfully high, in our opinion.
We will try to set t two times with a weak correlation threshold, twice with a moderate correlation threshold and twice with the strong correlation threshold. We avoid to use negligible values of correlation and very strong values of correlation because the result will include an excessive number of edges or zero edges to see how the number of edges change.
The selected values are 0.2,0.3, 0.5, 0.6, 0.7, 0.8. We continue showing how the number of edges changes varying t.
## edges_asd edges_td
## t = 0.20 537 578
## edges_asd edges_td
## t = 0.30 257 272
## edges_asd edges_td
## t = 0.50 54 46
## edges_asd edges_td
## t = 0.60 21 16
## edges_asd edges_td
## t = 0.70 5 6
## edges_asd edges_td
## t = 0.80 0 0
Since our goal is to select a threshold meaningfully high we discarded the weak values of correlation. The strong values of correlation gives us a number of edges too low, few features are highly correlated. We select a threshold such that we can say that exists a moderate intensity of linear dependency between features, our choice is \(t = 0.5\)
## edges_asd edges_td
## t = 0.50 54 46
With this configuration, using the adjacency matrix, we found 54 edges for asd and 46 edges for td, a little difference in terms of quantities.
We reached the two separate estimate, \(\hat{G}^{ASD}(t)\) and \(\hat{G}^{TD}(t)\) of the true association graphs and for now we show them through the adjacency matrices.
Adjacency matrix for \(\hat{G}^{ASD}(0.50)\)
## X2111 X2112 X2201 X2202 X2211 X2212
## 2001 0 0 0 0 0 0
## 2002 0 0 0 0 0 0
## 2101 0 0 1 0 0 0
## 2102 0 0 0 1 0 0
## 2111 0 0 0 0 1 0
## 2112 0 0 0 0 0 0
Adjacency matrix for \(\hat{G}^{TD}(0.50)\)
## X2611 X2612 X2701 X2702 X3001 X3002
## 2001 0 0 0 0 0 0
## 2002 0 0 0 0 0 0
## 2101 0 0 0 0 0 0
## 2102 0 0 0 0 0 0
## 2111 0 0 0 0 0 0
## 2112 0 0 1 0 0 0
Point 3
We want to represent the estimated graphs \(\hat{G}^{ASD}(0.50)\) and \(\hat{G}^{TD}(0.50)\) using igraph and ggraph packages.
## IGRAPH 11953e0 UN-- 116 54 --
## + attr: name (v/c)
## + edges from 11953e0 (vertex names):
## [1] 2101--2201 2102--2202 2102--2602 2111--2211 2112--2612 2112--2701
## [7] 2112--2702 2211--2212 2311--2312 2321--2322 2332--3002 2332--8102
## [13] 2501--2502 2601--2602 2602--2612 2611--2612 2612--2702 2701--2702
## [19] 3001--7011 4001--4002 4011--4012 4111--4112 5001--5002 5001--5021
## [25] 5002--5021 5002--5022 5011--5012 5011--5102 5021--5022 5101--5202
## [31] 6101--6201 6211--6212 7011--7021 7012--7021 8111--8112 8211--8212
## [37] 9002--9012 9022--9032 9022--9100 9022--9110 9032--9100 9032--9110
## [43] 9032--9130 9042--9130 9042--9140 9061--9062 9062--9071 9062--9072
## + ... omitted several edges
## IGRAPH 8b6c3da UN-- 116 46 --
## + attr: name (v/c)
## + edges from 8b6c3da (vertex names):
## [1] 2112--2701 2201--2311 2301--2311 2302--2312 2331--8111 2332--3002
## [7] 2332--8102 2601--2602 2611--2612 2612--2702 2701--2702 3001--7011
## [13] 3002--7012 3002--8102 4001--4002 4011--4012 4021--4022 5001--5002
## [19] 5001--5011 5001--5021 5001--5022 5002--5022 5011--5012 5011--5102
## [25] 5012--5102 5021--5022 5101--5202 5102--5202 6001--6002 6101--6201
## [31] 6201--6202 6211--6212 6401--6402 7012--7021 7012--7101 8111--8112
## [37] 9001--9002 9002--9012 9012--9052 9022--9032 9022--9100 9022--9110
## [43] 9032--9100 9032--9110 9042--9130 9160--9170
To reach a better visualization and easily understand the edges among the nodes we want to use another layout.
Now we can clearly visualize if two nodes have an edge, which implies the value of the correlation between the two ROIs is greater than 0.5.
To reach a better visualization and easily understand the edges among the nodes we want to use another layout.
Now we can clearly visualize if two nodes have an edge, which implies the value of the correlation between the two ROIs is greater than 0.5.
To draw conclusions and understand if there are clear co-activation differences between the two groups we will to build another graph named \(\hat{G}^{\Delta}(t)\) which has an edge between two vertices \(j\) and \(k\) if \(|{\Delta^{j,k}}| = |\rho_{j,k}^{ASD} - \rho_{j,k}^{TD}| \geq t\).
We built the confidence interval for the difference between the two groups. In order to estimate the difference graph we set a threshold that indicates when the co-activation differences are relevant. The threshold t, in this case, is not a correlation value, but an index in the range [0,2] (the maximum difference can be 2). Using again the trial and error approach we are going to explore the change of the graph configuration varying the threshold t.
Once we get a (1- \(\alpha/m\)) confidence interval (where \(m = \binom{D}{2}\)) \(C_{n}^{j,k}(\alpha)\) for \(\rho(j,k)\),we can then place an edge between feature \(j\) and feature \(k\) whenever \([-t,t] ∩ C_{n}^{j,k}(\alpha)= ∅\)(the empty-set).
Now, we show how many edges are formed using a high enough value of t.
## number_of_edges_diff
## t = 1.5 0
## t = 1.35 0
## t = 1.2 0
## t = 1 0
## t = 0.8 0
## t = 0.5 0
## t = 0.3 0
## t = 0.2 2
Although the range of t is \([0,2]\) we can see that for a big t value there aren’t edges that indicates a clear co-activation difference between ASD and TD patients. If we decrease the threshold to 0.2 we have the first two edges that indicates a difference.
## IGRAPH 7d8ae72 UN-- 116 2 --
## + attr: name (v/c)
## + edges from 7d8ae72 (vertex names):
## [1] 9062--9082 9071--9082
In the graph above we are visualizing the only 2 differences using confidence intervals built with \(t = 0.2\). We can say that between ASD and TD patients the ROI co-activations that show a difference are 9062–9082 and 9071–9082.
To recognize the most relevant differences we have to further decrease the threshold because we noticed that, building confidence intervals with Bonferroni Correction, there are no big differences between the two groups in terms of correlations. We want to show the 20 most relevant differences and the associated ROIs. After doing some tests we discovered that the threshold that we have to use is 0.0822.
## number_of_edges_diff
## t = 0.0822 20
Here we can see the 20 most relevant differences in terms of co-activation.
## IGRAPH f127965 UN-- 116 20 --
## + attr: name (v/c)
## + edges from f127965 (vertex names):
## [1] 2001--9002 2102--4021 2311--2502 3002--9062 5102--9051 5201--9170
## [7] 7021--9002 7021--9041 7022--9012 7101--9002 7101--9041 7102--8211
## [13] 8101--9031 8101--9110 8102--9150 9062--9082 9071--9082 9072--9081
## [19] 9072--9082 9081--9120
If instead of selecting the threshold based on the 20 most relevant correlations we wanted to use percentiles again, we would need to use a statistical measure that indicates a threshold for which the difference is relevant, based on its own distribution. We decide to opt for the median which cuts the distribution in two equal parts, so we consider relevant all the differences that are on the right of this measure.
Visualizing the distribution of absolute differences between the correlation values of ASD and TD and the median.
## number_of_edges_diff
## t = 0.061 40
In the graph above are represented the 40 differences that are relevant for the median chosen as threshold.
As we did in the case of differences between correlations, we want to raise the threshold t to understand even on individual graphs which connections are most resilient in terms of co-activation correlations. Let’s start the competition!
We start from the threshold that we previously chose, namely 0.50. Our purpose is to reach the three most resilient connections for the two groups.
## number_of_resilient
## t = 0.55 29
## t = 0.60 21
## t = 0.65 9
## t = 0.70 5
## t = 0.75 2
## t = 0.8 0
And searching between 0.70 and 0.75 we discovered that…
## number_of_resilient
## t = 0.725 3
The first position is taken by the connection between ROI 5001 and ROI 5002.
We start from the threshold that we previously chose, namely 0.50. Our purpose is to reach the three most resilient connections for the two groups.
## number_of_resilient
## t = 0.55 34
## t = 0.60 16
## t = 0.65 9
## t = 0.70 6
## t = 0.75 1
## t = 0.8 0
And searching between 0.70 and 0.75 we discovered that…
## number_of_resilient
## t = 0.7362 3
The first position is taken again by the connection between ROI 5001 and ROI 5002. It is the most resilient in both groups.
We built the confidence interval for the difference between the two groups. In order to estimate the difference graph we set a threshold that indicates when the co-activation differences are relevant. The threshold t, in this case, is not a correlation value, but an index in the range [0,2] (the maximum difference can be 2). Using again the trial and error approach we are going to explore the change of the graph configuration varying the threshold t.
Once we get a (1- \(\alpha/m\)) confidence interval (where \(m = \binom{D}{2}\)) \(C_{n}^{j,k}(\alpha)\) for \(\rho(j,k)\),we can then place an edge between feature \(j\) and feature \(k\) whenever \([-t,t] ∩ C_{n}^{j,k}(\alpha)= ∅\)(the empty-set).
Now, we show how many edges are formed using a high enough value of t.
## number_of_edges_diff
## t = 1.5 0
## t = 1.35 0
## t = 1.2 0
## t = 1 0
## t = 0.8 0
## t = 0.5 0
## t = 0.3 2
## t = 0.2 22
The first thing that we notice is that here the differences are more relevant. With Bonferroni correction previously we had only two differences with a threshold \(t = 0.20\), but now we have 20 more edges. There are also 2 differences that are resulting as resilient to the threshold \(0.3\)
Although the range of t is \([0,2]\) we can see that for a big t value there aren’t edges that indicates a clear co-activation difference between ASD and TD patients. If we decrease the threshold to 0.3 we have the first two edges that indicates a difference.
## IGRAPH 247abcc UN-- 116 22 --
## + attr: name (v/c)
## + edges from 247abcc (vertex names):
## [1] 2001--9002 2102--4021 2311--2502 3002--9062 4012--9071 5102--9051
## [7] 5201--9170 7021--9002 7021--9041 7022--9012 7101--9002 7101--9041
## [13] 7102--8211 8101--9031 8101--9110 8102--9150 8301--9062 9062--9082
## [19] 9071--9082 9072--9081 9072--9082 9081--9120
In the graph above we are visualizing the only 22 differences using confidence intervals built with \(t = 0.2\) without Bonferroni correction. We can see also graphically that the number of edges is increased.
To recognize the most relevant differences we have to further decrease the threshold because we noticed that, building confidence intervals also without Bonferroni Correction, there are no big differences between the two groups in terms of correlations. We want to show the 20 most relevant differences and the associated ROIs, but first we want to use the t previously selected in the Bonferroni case to show what happens if we skip this step.
## number_of_edges_diff
## t = 0.0822 439
For this t we had 20 edges, now we have 439 relevant differences. Let’s plot them!
Now we want to find the t such that we recognize only 20 relevant differences. After doing some tests we discovered that the threshold that we have to use is 0.20225.
## number_of_edges_diff
## t = 0.20225 20
Here we can see the 20 most relevant differences without Bonferroni correction.
## IGRAPH 38a1fc2 UN-- 116 20 --
## + attr: name (v/c)
## + edges from 38a1fc2 (vertex names):
## [1] 2001--9002 2102--4021 2311--2502 3002--9062 5102--9051 5201--9170
## [7] 7021--9002 7021--9041 7022--9012 7101--9002 7101--9041 7102--8211
## [13] 8101--9031 8101--9110 8102--9150 9062--9082 9071--9082 9072--9081
## [19] 9072--9082 9081--9120
If instead of selecting the threshold based on the 20 most relevant correlations we wanted to use percentiles again, we would need to use a statistical measure that indicates a threshold for which the difference is relevant, based on its own distribution. We decide to opt for the median which cuts the distribution in two equal parts, so we consider relevant all the differences that are on the right of this measure.
## number_of_edges_diff
## t = 0.061 700
We observe that for the threshold 0.061 that is the median of the correlation differences the edges now are 700 whereas previously they were only 40. Always remember that given we’re not controlling for multiplicity, it doesn’t make any statistical sense to comment about the overall topology of the graph. In other words, we can comment on the presence or not of a single edge, but not on the co-occurrence of edges in the graph (…at the required 95% confidence level!).
In the graph above are represented the 700 differences that are relevant for the median chosen as threshold.
As we did in the case of differences between correlations, we want to raise the threshold t to understand even on individual graphs which connections are most resilient in terms of co-activation correlations without Bonferroni Correction. Let’s start the competition!
We start from the threshold that we previously chose, namely 0.50. Our purpose is to reach the three most resilient connections for the two groups.
## number_of_resilient
## t = 0.55 25
## t = 0.60 15
## t = 0.65 5
## t = 0.70 3
## t = 0.75 0
## t = 0.8 0
We discover that the three most resilient connections are spotted using a threshold equal to 0.70.
## number_of_resilient
## t = 0.7 3
Without Bonferroni correction the ranking is the same and the first position is taken again by the connection between ROI 5001 and ROI 5002.
We start from the threshold that we previously chose, namely 0.50. Our purpose is to reach the three most resilient connections for the two groups.
## number_of_resilient
## t = 0.55 31
## t = 0.60 13
## t = 0.65 6
## t = 0.70 4
## t = 0.75 0
## t = 0.8 0
And searching between 0.70 and 0.75 we discovered that…
## number_of_resilient
## t = 0.7005 3
The first position is taken again by the connection between ROI 5001 and ROI 5002. It is the most resilient in both groups also without Bonferroni.
Point 4
We want to repeat the analysis using the partial correlation coefficient. Since we are dealing with a low-dimensional setting (\(D < n = 1740\)) we can estimate \(R_{(p)}\) as follows. Let \(\hat{\Sigma_n}\) be the sample covariance and \(\hat{\Lambda} = \hat{\Sigma_n}^{-1}\) its inverse, we can define : \(\hat{R}_{(p)}[j,k] = \hat{\rho}_{j,k}^{(p)} = -\frac{\hat{\Lambda}_{j,k}}{\sqrt{\hat{\Lambda_{j,j}} * \hat{\Lambda_{k,k}}}}\). We implemented by hand this method and then we use the library ppcor, in particular, the function pcor to compare the results.
We show the first 6 rows and 5 columns of the partial correlation matrix of asd patients.
## X2001 X2002 X2101 X2102 X2111
## 2001 1.000 0.257 0.102 -0.051 0.009
## 2002 0.257 1.000 -0.126 0.045 0.019
## 2101 0.102 -0.126 1.000 0.120 0.149
## 2102 -0.051 0.045 0.120 1.000 0.038
## 2111 0.009 0.019 0.149 0.038 1.000
## 2112 0.097 -0.070 -0.107 0.034 0.279
We repeat the computation with the function pcor and we compare the elements inside the matrices. Below wa can see that the partial correlation values coincide.
## X2001 X2002 X2101 X2102 X2111
## 2001 TRUE TRUE TRUE TRUE TRUE
## 2002 TRUE TRUE TRUE TRUE TRUE
## 2101 TRUE TRUE TRUE TRUE TRUE
## 2102 TRUE TRUE TRUE TRUE TRUE
## 2111 TRUE TRUE TRUE TRUE TRUE
## 2112 TRUE TRUE TRUE TRUE TRUE
## 2201 TRUE TRUE TRUE TRUE TRUE
## 2202 TRUE TRUE TRUE TRUE TRUE
## 2211 TRUE TRUE TRUE TRUE TRUE
## 2212 TRUE TRUE TRUE TRUE TRUE
We show the first 6 rows and 5 columns of the partial correlation matrix of td patients.
## X2001 X2002 X2101 X2102 X2111
## 2001 1.000 0.182 0.058 -0.111 0.034
## 2002 0.182 1.000 -0.007 0.090 -0.053
## 2101 0.058 -0.007 1.000 0.062 -0.023
## 2102 -0.111 0.090 0.062 1.000 -0.047
## 2111 0.034 -0.053 -0.023 -0.047 1.000
## 2112 -0.041 -0.058 -0.088 0.153 0.257
We repeat the computation with the function pcor and we compare the elements inside the matrices. Below we can see that the partial correlation values coincide.
## X2001 X2002 X2101 X2102 X2111
## 2001 TRUE TRUE TRUE TRUE TRUE
## 2002 TRUE TRUE TRUE TRUE TRUE
## 2101 TRUE TRUE TRUE TRUE TRUE
## 2102 TRUE TRUE TRUE TRUE TRUE
## 2111 TRUE TRUE TRUE TRUE TRUE
## 2112 TRUE TRUE TRUE TRUE TRUE
## 2201 TRUE TRUE TRUE TRUE TRUE
## 2202 TRUE TRUE TRUE TRUE TRUE
## 2211 TRUE TRUE TRUE TRUE TRUE
## 2212 TRUE TRUE TRUE TRUE TRUE
We show the first 6 rows and 5 columns of the partial correlation matrix of differences.
## X2001 X2002 X2101 X2102 X2111
## 2001 0.000 0.075 0.045 0.060 -0.025
## 2002 0.075 0.000 -0.119 -0.045 0.072
## 2101 0.045 -0.119 0.000 0.059 0.172
## 2102 0.060 -0.045 0.059 0.000 0.086
## 2111 -0.025 0.072 0.172 0.086 0.000
## 2112 0.137 -0.012 -0.019 -0.120 0.022
We have to build the estimated graphs and to represent them, we use the adjacency matrices. We used the same threshold previously chosen (0.50) to compare the results and show the first difference.
## edges_asd edges_td
## t = 0.50 0 0
We can see that using a threshold of 0.50 there aren’t edges. The partial correlation of each pair of variables given the others measures the linear association between the two ROIs after removing the effect of the confounders (all the other ROIs). Since here there are no edges with the same threshold t (0.50) we can say that the previous correlation values and consequently formed edges are inflated by the effect of the remaining ROIs. Now, we try to represent all edges which are formed on the assumption that the corresponding Bonferroni-adjusted 95% confidence interval does not contain the value 0. We set \(t =0\) and then we will start setting a meaningfully high percentile of the partial correlation values observed.
## edges_asd edges_td
## t = 0.00 135 131
We can see that there are only about 130 edges and now we want to show the respective graphs.
We are represented the only association edges that are based on confidence intervals that don’t include 0 as possible value of correlation for asd group.
We are represented the only association edges that are based on confidence intervals that don’t include 0 as possible value of correlation for td group.
Now, we are going to select the threshold t for the partial case.
We start by showing the quartiles of the two groups, we consider the absolute value of the partial correlation coefficients.
## asd td
## 25% 0.022 0.022
## 50% 0.046 0.047
## 75% 0.081 0.083
We can see that for both groups the 75% of the partial correlation values are lower than 0.081 for asd group and lower than 0.083 for td group. It means that for the most part values fall into the category of “negligible” correlation. We are interested in setting a meaningfully high value to understand the most correlated ROIs, but we are aware that they can also be very low. We create a sequence of percentiles from 0.75 to 1 and we will select the one that has a correlation values at least equal to 0.2, because we already know that aren’t edges for a moderate threshold.
## asd td
## 75% 0.081 0.083
## 77.5% 0.086 0.088
## 80% 0.092 0.094
## 82.5% 0.097 0.100
## 85% 0.104 0.109
## 87.5% 0.113 0.117
## 90% 0.123 0.127
## 92.5% 0.140 0.140
## 95% 0.163 0.167
## 97.5% 0.246 0.235
## 100% 1.000 1.000
Through the table above we chose the percentile at 97.5% because it corresponds to a threshold greater than 0.23, which is enough to identify moderate-weak correlations.
We will build the adjacency matrices in order to graphically represent the estimated graphs
## edges_asd edges_td
## t = 0.23 9 9
With this configuration, using the adjacency matrix, we found 9 edges for asd and 9 edges for td.
We reached the two separate estimates, \(\hat{G}^{ASD}(0.23)\) and \(\hat{G}^{TD}(0.23)\) and now we want to graphically represent them using igraph and ggraph packages.
## IGRAPH 3867094 UN-- 116 9 --
## + attr: name (v/c)
## + edges from 3867094 (vertex names):
## [1] 2401--2402 2611--2612 4001--4002 5001--5002 5011--5012 6101--6201 8111--8112
## [8] 8211--8212 9081--9082
## IGRAPH 8df61b5 UN-- 116 9 --
## + attr: name (v/c)
## + edges from 8df61b5 (vertex names):
## [1] 2611--2612 4001--4002 4021--4022 5001--5002 5011--5012 5022--9120 6211--6212
## [8] 6301--6302 8111--8112
Now we can clearly visualize if two asd nodes have an edge, which implies the value of the correlation between the two ROIs is greater than 0.23.
Now we can clearly visualize if two td nodes have an edge, which implies the value of the correlation between the two ROIs is greater than 0.23.
We wanted to use percentiles again,and we use a statistical measure that indicates a threshold for which the difference is relevant, based on its own distribution. We decide to opt for the median which cuts the distribution in two equal parts, so we consider relevant all the differences that are on the right of this measure.
Visualizing the distribution of absolute differences between the partial correlation values of ASD and TD and the median.
We build the adjacency matrix and we show the number of edges of this configuration.
## number_of_edges_diff
## t = 0.056 11
We want to represent graphically these co-activation differences:
The following are the only resulting edges:
## IGRAPH a88b719 UN-- 116 11 --
## + attr: name (v/c)
## + edges from a88b719 (vertex names):
## [1] 2101--2401 2321--3001 2321--7011 2331--8101 5011--6102 5301--6401
## [7] 6402--8101 7002--9001 8211--8302 9002--9071 9002--9110
We can see that the number of edges concerning the partial correlations that are formed using the median as threshold are lower than the number of edges concerning the correlation coefficient. Again we can affirm that the pairwise ROI’s correlation given all the others are lower than the the pairwise correlation considering the effect of he other variables(ROIs).
Now, we want to repeat the “competition” among ROIs using the partial correlation coefficient to spot what are the last connections to die, the more “resilient”. Let’s start the competition!
We start from the threshold that we previously chose, namely 0.23. Our purpose is to reach the three most resilient connections for the two groups.
## number_of_resilient
## t = 0.23 9
## t = 0.26 4
## t = 0.30 3
## t = 0.33 2
## t = 0.37 1
## t = 0.40 0
We discovered that to identify the most three resilient values we have to set the threshold at 0.30
## number_of_resilient
## t = 0.3 3
In the plots above we can see the rankings of correlation and partial correlation of asd patients to understand the most resilient connections. The first position is always taken by 5001-5002 whereas the other positions change. For example, the connection at the second position in the left plot becomes the third using the partial correlation.
We start from the threshold that we previously chose, namely 0.23. Our purpose is to reach the three most resilient connections for the two groups.
## number_of_resilient
## t = 0.23 9
## t = 0.26 5
## t = 0.30 2
## t = 0.33 1
## t = 0.37 1
## t = 0.40 0
We discovered that to identify the most three resilient values we have to set the threshold between 0.26 and 0.30. After some trials we discover that the threshold to identify only the most three resilient is 0.29.
## number_of_resilient
## t = 0.29 3
We want to represent the most three most resilient connections with a graph
In the plots above we can see the rankings of correlation and partial correlation of td patients to understand the most resilient connections. The first position is always taken by 5001-5002 whereas the other positions change. For example, the connection at the second position in the left plot becomes the third using the partial correlation. The second position of the partial plot is taken by 2611-2612.